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Abstract 

The advent of social media has provided an extraordinary, if imperfect, big data window into the form and evolution of social 
networks. Based on nearly 40 million message pairs posted to Twitter between September 2008 and February 2009, we construct 
and examine the revealed social network structure and dynamics over the time scales of days, weeks, and months. At the level of 
user behavior, we employ our recently developed hedonometric analysis methods to investigate patterns of sentiment expression. 
^-H We find users average happiness scores to be positively and significantly correlated with those of users one, two, and three links 
away. We strengthen our analysis by proposing and using a null model to test the eff'ect of network topology on the assortativity of 
happiness. We also find evidence that more well connected users write happier status updates, with a transition occurring around 
Dunbar's number. More generally, our work provides evidence of a social sub-network structure within Twitter and raises several 
methodological points of interest with regard to social network reconstructions. 
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^ 1. Introduction 

c/^ Social network analysis has a long history in both theoretical 
, and applied settings 1 1 1. During the last 15 years, and driven 
by the increased availability of real-time, in-situ data reflect- 
ing people's social interactions and choices, there has been an 
explosion of research activity around social phenomena, and 
many new techniques for characterizing large-scale social net- 
works have emerged. Numerous studies have examined the 
^— H structure of online social networks in particular, such as blogs, 
Facebook, and Twitter ||2V[T9l. 

In a series of analyses of the Framingham Heart Study data 
T-H and the National Longitudinal Study of Adolescent Health, 
Christakis, Fowler, and others have examined how qualities 
^ such as happiness, obesity, disease, and habits (e.g., smoking) 
are correlated within social network neighborhoods ll20H25l . 
rN The authors' additional assertion of contagion, however, has 
^ been criticized primarily on the basis of the difficulties to be 
found in distinguishing these phenomena from homophily 1 26 - 
[28l . The observation that social networks exhibit assortativ- 
ity with respect to these traits evidently requires further study 
and leads us to explore potential mechanisms. Advances would 
naturally provide further insight into the nature of how social 
groups influence individual behavior and vice versa. 

Our focus in the present work is the social network of Twitter 
users. With the abundance of available data. Twitter serves as a 
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living laboratory for studying contagion and homophily 1291 . 
As a requisite step towards these goals, we first define sub- 
networks of Twitter users suitable to such study and, sec- 
ond, examine whether assortativity is observed in these sub- 
networks. Before describing our methods, we provide a brief 
overview of Twitter, related work, and the challenges associ- 
ated with social network analysis in this arena. 

Twitter is an online, interactive social media platform in 
which users post tweets, micro-blogs with a 140 character limit. 
Since its inception in 2006, Twitter has grown to encompass 
over 200 million accounts, with over 100 million of these ac- 
counts currently active as of October 2011, and with some 
users having garnered over 10 million followers (301 . Tweets 
are open online by default, and are also broadcast directly to 
a user's followers. Users may express interest in a tweet by 
retweeting the message to their followers. Alternatively, fol- 
lowers may reply directly to the author. 

Understanding the topology of the Twitter network, the man- 
ner in which users interact and the diffusion of information 
through this media is challenging, both computationally and 
theoretically. One of the central issues in characterizing the 
topology of any network representation of Twitter lies in defin- 
ing the criteria for establishing a link between two users. 
The majority of previous studies have examined the topology 
of and information cascades on the Twitter follower network 
fTl fTOlfTSl , as well as on networks derived from mutual follow- 
ing fS]. However, the follower network is not the only repre- 
sentation of Twitter's social network, and its structure can be 
misleading 13T1 . For example, in a study of over 6 million 
users, Cha et al. ifTOl found that users with the highest follower 
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counts were not the users whose messages were most frequently 
retweeted. This suggests that such popular users (as measured 
by follower count) may not be the most influential in terms of 
spreading information, and this calls into question the extent to 
which users are influenced by those that they follow |32|. Of 
further concern is the finding of low reciprocity within follower 
networks. Kwak et al. found very few individuals who followed 
their followers ifTSl . As a result, trying to infer meaningful in- 
fluence and contagion in such a network is difficult. 




(a) Followers (b) Interaction 



Figure 1 : (a) Follower network: The follower network is generated by declared 
following choices, absent any messages being sent. If user v/ broadcasts tweets 
to followers vj, Vk and (represented by the dashed, blue arrow) v/ would be 
connected to each of vj, Vk and vi by a directed link in a follower network, 
(b) Reciprocal-reply network: Directed replies are represented by a solid black 
arrow. When considering the interaction between users, a reply (i.e., vi replies 
to V/) provides evidence of a directional interaction between nodes. We mandate 
a stronger condition for interaction, namely reciprocal replies (i.e., Vj replies to 
Vi and vice versa) over a given time period. Thus v/ and vj are connected in the 
reciprocal reply network that we construct. 

While popular users and their many followers clearly exhibit 
an affiliation, they do not necessarily interact, as there are diff'er- 
ent relationships implicated by broadcasting (tweeting), send- 
ing a message (©someone), and replying to a message. As an 
example, we consider a user represented by node v/ which has 
three followers, represented by vy, Vk, and as shown in Fig. 
[T^. When a user broadcasts tweets to their many followers, as 
represented by the directed arrow in Fig. [T^, this does not imply 
that followers read or respond to these tweets. Followers vy, Vk, 
and Vi receive all tweets broadcast by node v/, but this provides 
no guarantee of interaction. Suppose, though, that we observe 
that V£ replies to v/ as shown in Figure [TJ). This provides evi- 
dence (but not proof) that the user represented by has indeed 
received a tweet from v/ and is sufficiently motivated to create 
a response to v/. Although a directional network based on these 
replies can be created, such a directional interaction, however, 
does not suggest reciprocity between the nodes. In this exam- 
ple, we have no evidence that v/ has, in any way, considered or 
even read such a response from his/her follower. 

We conclude that following and unreciprocated replies are 
not sufficient for interaction and present an alternative means 
by which to derive a social network from Twitter messages, via 
reciprocal replies. In our reciprocal-reply network, two nodes, 
Vi and Vj, are connected if v/ has replied to vj and vj has replied 
to Vi at least once within a given time period of consideration. 
In Figure [T|), the nodes v/ and vj meet this criterion. 

Another challenge in characterizing the topology of any net- 
work representation of Twitter concerns determining how long 
a link between two users in the network should persist. Includ- 



ing stale user-user interactions in the network mistakenly cre- 
ates an inaccurate portrayal of the current state of the system; 
this is typically referred to as the "unfriending problem" |26l. 
Not only will network statistics such as the number of nodes, 
average degree, maximum degree and proportion of nodes in 
the giant component be artificially inflated due to superfluous, 
no-longer-active links |26,,33J, but the degree distribution will 
also be distorted. Kwak et al. ifTSll found that the degree distri- 
bution for a Twitter follower network deviated from a power law 
distribution due to an overabundance of high degree nodes re- 
sulting from an accumulation of "dead- weight" in the network. 

Additional problems are encountered if one uses accumu- 
lated network data to measure assortativity with respect to a 
trait (e.g., happiness). As an example, consider a network in 
which two users are connected because they interacted during 
the last week of a year-long study. Including this user-user pair 
in the list of pairs to compute assortativity for the entire network 
blurs the relationship between more consistent and repeated in- 
teractions that occurred throughout the timespan of the study. 
Further complications arise when averaging a user's trait over 
a large time scale (i.e., averaging happiness over a 6 month or 
12 month timespan). Detecting changes in users' traits over 
time and how these may (or may not) be correlated with near- 
est neighbors' traits is of fundamental importance; accumulated 
network data occludes exactly the interactions we are looking 
to understand. Recognizing that, due to practical limitations, 
accumulation of network data must occur on some scale, we an- 
alyze users in day, week, and month reciprocal reply networks. 
By examining networks constructed at smaller time scales and 
calculating users' happiness scores based on tweets made only 
during that time period, we aim to take a more dynamic view of 
the network. 

In addition to defining reciprocal reply networks and advo- 
cating for their use, we also seek to describe how happiness is 
distributed in the reciprocal reply networks of Twitter. Previ- 
ous hedonometric work with Twitter data has revealed cycli- 
cal fluctuations in average happiness at the level of days and 
weeks, as well as spikes and troughs over a time scale of years 
corresponding to events such as U.S. Presidential Elections, the 
Japanese tsunami and major holidays LiL^ . 35] . Other studies 
have examined changes in valence of tweets associated with the 
death of Michael Jackson |14|, changes in the U.S. Stock Mar- 
ket 1 9 1, the Chilean Earthquake of 2010, and the Oscars |[T6ll . 
In the present work, we seek to understand localized patterns of 
happiness in the Twitter users' social network. 

Understanding how emotions are distributed through social 
networks, as well as how they may spread, provides insight into 
the role of the social environment on individual emotional states 
of being, a fundamental characteristic of any sociotechnical 
system. BoUen et al. IH examine a reciprocal-follower network 
using Twitter and suggest that Subjective Weil-Being (SWB), 
a proxy for happiness, is assortative. Building on their work, 
we address whether happiness is assortative in reciprocal-reply 
networks. We also test the hypothesis of Christakis and Fowler 
1 25 1 who find evidence that the assortativity of happiness may 
be detected up to three links away. In doing so, we raise an 
additional point which is not specific to Twitter networks, but 
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rather relates to empirical measures of assortativity in general. 
Relatively few studies have employed a null model for calcu- 
lating the pairwise correlations (e.g., happiness-happiness). We 
devise a null model which maintains the topology of the net- 
work and randomly permutes happiness scores attached to each 
node. By randomly permuting users' happiness scores, we can 
detect what effect, if any, network structure has on the pairwise 
correlation coefficient. 

We organize our paper as follows: In Section 2, we de- 
scribe our data set, the algorithm for constructing reciprocal- 
reply networks, network statistics used for characterizing the 
networks, and our measure for happiness. We propose an alter- 
native means by which to detect social structure and argue that 
our method detects a large social sub-network on Twitter. In 
Section 3, we describe the structure of this network, the extent 
to which it is assortative with respect to happiness and the re- 
sults of testing assortativity against a null model. In Section 4, 
we discuss these findings and propose further investigations of 
interest. 

2. Methods 

2.7. Data 

From September 2008 to February 2009, we retrieved over 
100 million tweets from the Twitter streaming API service]^ 
While the volume of our feed from the Twitter API increased 
during this study period, the total number of tweets grew at a 
faster rate (Fig. [2]). During this time period, we estimate that we 
collected roughly 38% of all tweets]^ The number of messages 
and percent of which were replies are reported in Table 
For the remainder of this paper, we restrict our attention to the 
nearly 40 million message-reply pairs within this data set and 
the users who authored these tweets. 

The data received from the Twitter API service for each tweet 
contained separate fields for the identification number of the 
message (message id), the identification number of the user who 
authored the tweet (user id), the 140 character tweet, and sev- 
eral other geo-spatial and user-specific metadata. If the tweet 
was made using Twitter's built-in reply function j^the identifica- 
tion number of the message being replied to (original message 
id) and the identification of the user being replied to (original 
user id) were also reported. 

We acknowledge two sources of missing data. First, the Twit- 
ter API did not allow us access to all tweets posted during the 6 



Data was received in XML format. 

^We calculated the total number of messages as the difference between the 
last message id and the first message id that we observe for a given week. This 
provides a reasonable estimate of the number of tweets made per week, as mes- 
sage ids were assigned (by Twitter) sequentially during the time period of this 
study. 

^Twitter has a built-in reply function with which users reply to specific mes- 
sages from other users. Tweets constructed using Twitter's reply function begin 
with '@username', where 'username' is the Twitter handle of the user being 
replied to; the user and message ids of the tweet being replied to are included in 
the reply message's metadata from the Twitter API. Users often informally re- 
ply to or direct messages to other users by including said users' Twitter handles 
in their tweets. In such cases, however, no identification information about the 
"mentioned" user is included in the API parameters for these tweets (only their 
Twitter handle is) and we exclude such exchanges when building the reciprocal 
reply network. 
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Figure 2: Tweet counts are plotted for the weeks between September 2008 and 
February 2009. The three curves represent the total, those that we observed and 
the number of the observed tweets that constituted replies. 

month period under consideration. Thus, there are replies that 
we have not observed. As a result, some users may remain un- 
connected or connected by a path of longer length due to miss- 
ing intermediary links in our reciprocal-reply network (Fig. [3]). 
Secondly, we acknowledge that users may be interacting with 
each other and not using the built-in reply function. We discuss 
this further in the next section. 
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Figure 3: The effect of missing links in the reciprocal reply network is de- 
picted where observed links are shown as a solid line and an unobserved link 
is shown as a dashed line. The effect of unobserved links is twofold: (1) some 
connections between nodes are missed (e.g., Vj and vi are not connected in the 
observed reciprocal reply network); and (2) some path lengths between nodes 
are artificially inflated (e.g., the distance from v/ to is 3 in the observed 
reciprocal-reply network, however in reality the path length is 2). 



2.2. Reciprocal-reply network 

In keeping with terminology used in the field of complex net- 
works, the terms nodes and links will be used henceforth to de- 
scribe users and their connections. Define Q = (V, to be a 
simple graph which contains, A/^ = |y| nodes and M = \E\ links. 
We construct the reciprocal-reply networks in which users are 
represented by nodes, v/ e V, and links connecting two nodes, 
Cij G E, indicate that v/ and vj have made replies to each other 
during the period of time under analysis (Fig. [T]). For each 
network, we remove self-loops (i.e., users who responded to 
themselves). We characterize the reciprocal -reply network for 
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each week by the calculation of network statistics such as N 
(the number of nodes), (k) (average degree), ^max (maximum 
degree), the number of connected components and S (propor- 
tion of nodes in the giant component). We calculate clustering, 
Cg, according to Newman's global clustering coefficient 1361 : 

^ 3 X (number of triangles on a graph) 
^ number of connected triples of nodes 

Assortativity refers to the extent to which similar nodes are 
connected in a network. Often, degree assortativity is quantified 
by computing the Pearson correlation coefficient of the degrees 
at each end of links in the network (371. Since we are interested 
in quantifying the extent to which the highest degree nodes are 
connected to other high degree nodes, as defined by the rank 
of their degrees, we instead measure degree assortativity by the 
Spearman correlation coefficient]^ Thus for each link that con- 
nects nodes v/ and vj, we examine the ranks of k^. and k^j. The 
Spearman correlation coefficient, which is the Pearson correla- 
tion coefficient applied to the ranks of the degrees at each end 
of links in the network, is a non-parametric test that does not 
rely on normally distributed data and is much less sensitive to 
outliers 

In addition, we also investigate user pairs which are con- 
nected by a minimal path length of two (or three) in the recip- 
rocal reply networks. We define J(v/, vj) to be the path length 
(i.e., number of links) between nodes v/ and vj such that no 
shorter path exists. As a consequence of missing messages, we 
recognize that some users will appear to remain unconnected or 
connected by a path of longer length. Figure [3] depicts the eff'ect 
of missing links on inferred path lengths between nodes in the 
network. Nodes vj and are adjacent in the network, however, 
due to the missing link represented by the dashed line, these 
nodes are inferred to be two links apart. 

2.3. Measuring happiness 

To quantify happiness for Twitter users, we apply the 
real-time hedonometer methodology for measuring sentiment 
in large-scale text developed in Dodds et al. Lll|- In this 
study, the 5000 most frequently used words from Twitter, 
Google Books (English), music lyrics (1960 to 2007) and 
the New York Times (1987 to 2007) were compiled and 
merged into one list of 10,222 unique words This word 
list was chosen solely on the basis of frequency of usage 
and is independent of any other presupposed significance of 
individual words. Human subjects scored these 10,222 words 
on an integer scale from 1 to 9 (1 representing sad and 9 
representing happy) using Mechanical Turk. We compute the 



^We present both the Spearman and Pearson correlation coefficient in the 
Appendix, Figure A2. Pearson's correlation coefficient is more sensitive to 
extreme values and thus obscures the trend in the data, namely that the network 
is assortative with respect to the rank (i.e., ordering) of nodes' degrees. 
^Our degree distribution is not Gaussian, as can be seen from FigureM 
^We provide a brief summary of this methodology here and refer the in- 
terested reader to the original paper for a full discussion. The supplementary 
information contains the full word list, along with happiness averages and stan- 
dard deviations for these words 1 11]. 




Figure 4: The happiness scores of words are plotted as a function of their rank 
(dots), with the stop words (words within ±Ah = 1 of /zavg = 5) depicted in light 
grey | 38 1. These words were excluded from the happiness score computation. 
The frequency of words and their rank (Inmost frequent, 9956=least frequent) 
are plotted (solid curve). Not all 10,222 labMT words were observed during 
the time period from September 2008-February 2009. 

average happiness score (/zavg) to be the average score from 50 
independent evaluations. Examples of such words and their 
happiness scores are: /Zavg(love)=8.42, /zavg (special) =7. 20, 
/zavg(house)=6.34, /zavg(work)=5.24, /zavg(sigh)=4.16, 
/Zavg(never)=3.34, /zavg(sad)=2.38, /Zavg(die)=1.74. Words 
that lie within ±A/Zavg = 1 of h^Yg=5 were defined as "stop 
words" and excluded to sharpen the hedonometer's resolution|j 
The result is a list of 3,686 words, hereafter referred to as 
the Language Assessment by Mechanical Turk (labMT) word 
list ifTTIl . See Tables Al and A2 for additional example word 
happiness scores. 
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Table 1: Happiness scores are computed as a weighted average of words' /zavg 
scores. Since "starts" is a stop word, it is not included in the calculation of 
havg(T) = 7.07. This example serves is included as a means to illustrate the 
methodology; in practice, the average is calculated over a much larger word 
set. 

Figure |4]presents word happiness as a function of usage rank 
for the roughly 10,000 words in the labMT data set. This fig- 
ure reveals a frequency independent bias towards the usage of 
positive words (see [37] for further discussion of this positivity 
bias). Proceeding with the labMT word list, a pattern-matching 
script evaluated each tweet for the frequency of words. We 
compute the happiness of each user by applying the hedo- 
nometer to the collection of words from all tweets authored 
by the user during the given time period. Note that each 



^For notational convenience, we henceforth use Ah in lieu of Ahi 



Figure 5: A visualization of the 162,445 nodes in the reciprocal reply network for the week beginning December 9, 2008 (Week 14) is depicted. Node colors 
represent connected components, a total of 15342, with the giant component (shown in blue) comprising 76 % of all nodes. The size of each node is proportional to 
its degree. The visualization was made using Gephi L39J . 
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Figure 6: Network statistics for the reciprocal-reply network are constructed at the scale of days (green), weeks (blue), and months (red). (A.) The number of users 
{N) engaged in reciprocal exchanges when viewed at the level of days, weeks, or months increases over the study period. (B.) The average degree {{k)) remains fairly 
constant throughout the study period, with higher values detected for larger interaction time periods. (C.) The maximum degree (/:max) shows variability throughout 
the study period. (D.) Clustering decreases quite likely resulting from the inability of the networks' closed triangles to keep up with the growing number of nodes. 
(E.) Degree assortativity remains fairly constant throughout the study period, and shows little sensitivity to the time period over which the networks represent 
interactions. (E.) The proportion of nodes in the giant component (^S) remains fairly constant for week and month networks, however, shows some variability during 
the first month of the study for day networks. 



users' collection of words likely reflects messages that were 
not replies. The happiness of this collection of words is taken 
to be the frequency weighted average of happiness scores for 



each labMT word as h^yg(T) 



where h^^giwd is the average happiness of the /th word appear- 
ing with frequency fi and where pi is the normalized frequency 
{Pi = T^ir-r)- ^ simple example example, we consider the 
phrase: Vacation starts today, yeahhhhh! in Table 1. In prac- 
tice, though, the hedonometer is applied to a much larger word 
set and is not applied to single sentences. 

Having found happiness scores for each node (user), we then 
form happiness-happiness pairs (hy.,hyj), where hy. and hy. de- 
note the happiness of nodes v/ and vj connected by an edge. The 
Spearman correlation coefficient of these happiness-happiness 
pairs measures how similar individuals' average happiness is 
to that of their nearest neighbors'. Lastly, we investigate the 
strength of the correlation between users' average happiness 
scores and those of other users in the two and three link neigh- 
borhoods. 

3. Results 

3.1. Reciprocal-reply network statistics 

Visualizations of day and week networks were created us- 
ing the software package Gephi |39|. Figures [5] and | A6| show 
a sample week and day network, respectively. All layouts were 
produced using the Force Atlas 2 algorithm, which is a spring 
based algorithm that plots nodes together if they are highly con- 
nected (see 1 40 1 for more details). The sizes of the nodes are 
proportional to the degrees. 

Network statistics, such as the number of nodes (N), the av- 
erage degree (k), the maximum degree (^max), global clustering 
Cg, degree assortativity (Assort), and the proportion of nodes 
in the giant component (S) are summarized in Figure|6] Several 
trends are apparent. 

Throughout the course of the study, the number of users 
in the observed reciprocal-reply network shows an increase. 
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Figure 7: Log-log plot of the complementary cumulative distribution function 
(CCDF) of the degree distribution for a sample week (week of January 27, 
2009) network is shown (blue), along with the best fitting power law model 
(a = 3.50 and kmin = 34) using the procedure of Clauset, Shalizi, and Newman 
| 41 1. We test whether the empirical distribution is distinguishable from a power 
law using the Kolmogorov-Smirnov test and find no evidence against the null 
hypothesis (D = 2.28 x 10'^, p = 0.095, « = 203852). 



whereas the average degree, degree assortativity, and propor- 
tion of nodes in the giant component remain fairly constant. 
The fluctuations in maximum degree are the result of celebri- 
ties or companies having bursts of high volume reply ex- 
changes with their fans during a particular week, for exam- 
ple Bob Bryar, Drummer for the band My Chemical Romance 
(^max = 1244, Week 12), Namecheap domain registration com- 
pany (/:max = 1245, Week 13), Twitter's own Shorty Awards 
(^max = 1456, Week 14), and Stephen Fry, actor and mega- 
blogger (^max = 1718, Week 22). This observation highlights 
the importance of examining network data on the appropriate 
time scale, otherwise information about these kinds of dynam- 
ics would be be lost. The clustering coefficient shows a slight 
decrease over the course of this period. This is most likely due 
to an increasing number of nodes, which results in a smaller 
proportion of closed triangles in the network. 

The degree distribution, P^, for a sample week (week begin- 
ning January 27, 2009) is presented in Figure |7] Using the ap- 
proach outlined by Clauset, Shalizi, and Newman [41 1, we find 
a lower bound for the scaling region to be ^min ~ 34 and a very 
steep scaling exponent of a = 3.5. This suggests a contrained 
variance and mean. We test whether the empirical distribution 
is distinguishable from a power law using the Kolmogorov- 
Smimov test and find no evidence against the null hypothesis 
for the week(Z) = 2.28x10-^/? = 0.095,^ = 203852). We find 
the same exponent and statistically stronger evidence of a power 
law for a sample month (see the Appendix, Fig. |A1| ). This sug- 
gests that these distributions' tails may be fit by a power law. 
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Figure 8: Nearest neighbor happiness assortativity as a function of the number 
of labMT words required per user is displayed for a sample week reciprocal- 
reply network. Notice that when Ah = 0, there is less variation due to the 
numerous words centered around the mean happiness score regardless of the 
threshold, a. While this stability is desirable, tuning Ah allows us to sharpen 
the resolution of the hedonometer. This tuning, however, must be balanced with 
the appropriate choice of a. 



3.2. Measuring happiness 

The application of the hedonometer gives reasonable results 
when applied to a large body of text, but can be misleading 
when applied to smaller units of language |11|. To provide a 
sense of how sensitive this measure is to the number of labMT 
words posted by users, we sampled happiness-happiness pairs, 
(hy.,hyj) whose respective users, v/ and vj, had posted at least 
a total labMT words during a sample week (week beginning 
January 27, 2009). For these users, we compute happiness 
assortativity and show the variation with a in Figure [8] For 
Ah = 0, there is less variation due to the numerous words cen- 
tered around the mean happiness score regardless of the thresh- 
old, a. Tuning both parameters too high results in few sampled 
words and corrupts the interpretation of the results. 

Figures [9] and [T0| reveal a weakening happiness-happiness 
correlation for users in the week networks as the path length 
between nodes increases. All correlations, for each week, were 
significant (p < 10"^^). This suggests that the network is assor- 
tative with respect to happiness and that user happiness is more 
similar to their nearest neighbors than those who are 2 or 3 links 
away. 



In Figure 11 we provide a visualization of an ego-network 
for a single node, including neighbors up to three links away. 
Nodes are colored by their /zavg score, illustrating the assortativ- 



ity of happiness. Figure [A5] visualizes the happiness assortativ- 
ity for an entire week network. 
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Figure 9: Average assortativity of happiness for week networks measured by 
Spearman's correlation coefficients as Ah is dialed from to 2.5, with a = 50. 
As Ah increases, the average correlation decreases. For large Ah the resulting 
words under analysis have more disparate happiness scores and thus the correla- 
tions between users' happiness scores are smaller. Similarly, choosing Ah to be 
too small (e.g.. Ah = 0) could result in an over estimate of happiness-happiness 
correlations because of the uni-modal distribution of /zavg for the labMT words. 
Thus a moderate value for Ah is chosen (Ah is set to 1 for this study). 



In Figure 12 we show the average happiness score as a func- 
tion of user degree k for all week networks. The average hap- 
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Figure 10: Happiness assortativity as measured by Spearman's correlation coefficients is shown for week networks, with A/z = 1 and (a) the threshold of labMT 
words written by users set to a = 1 and (b) a - 50. The dashed lines indicate weakening happiness-happiness correlations as the path length increases from one, 
two, and three links away, for each week in the data set. 
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Figure 1 1: A visualization of a user and its neighbors 3-links away for a week 
beginning September 9, 2008 (Week 1). Colors represent happiness scores for 
users posting more than of = 50 labMT words. Nodes depicted with the color 
black are nodes for which the user's wordbag did not meet our thresholding 
criteria. 



piness score increases gradually as a function of degree, with 
large degree nodes demonstrating a larger average happiness 
than small degree nodes. Large degree nodes use words such 
as "you," "thanks," and "lol" more frequently than small degree 
nodes, while the latter group uses words such as "damn," "hate," 
and "tired" more frequently. A word shift diagram, comparing 
nodes with k < 100 vs. nodes with k > 100 is included in the 
Appendix (Fig. |A7| ). Figure [12] also reveals that the number of 
large degree nodes is fairly small. Our results support recent 
work showing that most users of Twitter exhibit an upper limit 
on the number of active interactions in which they can be en- 
gaged (3111 . This may provide further evidence in support of 
Dunbar's hypothesis, which suggests that the number of mean- 
ingful interactions one can have is near 150 1421 . 
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Figure 12: Top Panel: The average happiness score as a function of user degree 
k for week networks is increasing, as larger degree nodes use fewer negative 
words (see Figure [AT) . Bottom Panel: The number of unique users is reported 
with respect to degree k; some users appear in more than one bin because they 
exhibit different degree k for different weeks of the study. 



8 



3.3. Testing assortativity against a null model 



4. Discussion 



To further examine these findings, we create a null model 
which maintains the network topology (i.e., adjacency matri- 
ces for one link, two link, and three link remain intact), but 
randomly permutes the happiness scores associated with each 
node. The Spearman correlation coefficient shows no statis- 
tically significant relationship for the null model applied to a 
sample week of the data set. Figure [13] shows the results of 



100 random permutations applied to nodes' associated happi- 
ness scores. The Spearman correlation coeflftcients for the ob- 
served data are shown as blue squares (A/zavg = 0) and green 
diamonds (A/Zavg = 1). The average and standard deviation of 
the Spearman correlation coefficient calculated for the 100 ran- 
domized happiness scores (null model) are shown as red circles 
with error bars (the error bars are smaller than the symbol). This 
data supports the hypothesis that happiness is less assortative as 
network distance increases. 

Lastly, we explore whether these correlations are due to sim- 
ilarity of word usage. For this analysis, we compute the simi- 
larity of word bags for users connected in the reciprocal reply 
networks. We compare the distribution of observed similarity 
scores to similarity scores obtained by randomly reassigning 



word bags to users. Figure A8 shows that both distributions 
are of a similar form, with the randomized version exhibiting 
a slightly lower mean similarity score {Pij = .167) as com- 
pared to the mean of the observed similarity scores for users 
(Dij = .267). If users were tweeting similar words with a sim- 
ilar frequency, we would expect a much larger mean similarity 
score for the observed data. Thus, we do not find evidence sug- 
gesting that the happiness correlations are due to similarity of 
word bags. 
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Figure 13: One hundred random permutations were applied to the happiness 
scores associated with each node in a sample week network (week beginning 
October 8, 2008 is shown), with Ah = (blue square) and Ah = (green dia- 
monds). The threshold for all cases is set to a = 50. The Spearman correlation 
coefficients, for the observed data are shown as blue squares. The average 
and standard deviation of the Spearman correlation coefficient calculated for 
the 100 randomized data (null model) are shown as red circles with error bars 
(the error bars are smaller than the symbol). The plot shows Spearman cor- 
relation coefficients for the null model to be nearly and provides supporting 
evidence for our observed trend, namely the network is assortative with respect 
to happiness and the strength of assortativity decreases as path length increases. 



In this paper, we describe how a social sub-network of Twit- 
ter can be derived from reciprocal-replies. Countering claims 
that Twitter is not social a network ifTSl , we provide evidence 
of a very social Twitter. The large volume of replies (millions 
every week) and assortativity of user happiness indicates that 
Twitter is being used as a social service. Furthermore, con- 
ducted at the level of weeks, our analysis examines an in the 
moment social network, rather than the stale accumulation of 
social ties over a longer period of time. A network in which 
edges are created and never disintegrate results in dead links 
with no contemporary functional activity. This problem of un- 
friending has been noted 1261 and can greatly impact conclu- 
sions drawn when observational data are used to infer conta- 
gion. 

Our characterization of the reciprocal reply network reveals 
several trends over the 25 week period from September 2008 to 
February 2009. The number of nodes, A^, in a given week net- 
work increased as time progressed, which is undoubtedly due 
to Twitter's enormous growth in popularity over the study pe- 
riod. Similarly, with an increasing number of nodes, we observe 
a smaller proportion of closed triangles (i.e., clustering shows 
a slight decrease). This may be due in part to sub-sampling 
effects or due to an increasing N, with which the number of 
closed triangles (i.e., friends of friends) cannot keep up. The 
proportion of nodes in the giant component remains fairly con- 
stant, as does degree assortativity as measured by Spearman's 
correlation coefficient. Had we used the Pearson correlation co- 
efficient, degree assortativity would have been highly variable 
(Fig. Al) due to the extreme values of maximum degree (kj^^x) 
during weeks 12-14 and 22. Using the Spearman rank corre- 
lation coefficient, which is less sensitive to extreme values, we 
find that the degree assortativity is fairly constant. 

Our work is based on a sub- sample of tweets and is thus 
subject to the eff'ects of missing data. The problem of miss- 
ing data has been addressed by several researchers investigat- 
ing the impact of missing nodes B31j47l . missing links, or both 
(48 1 . More specifically, the work of Stumpf f4T| shows that 
sub-sampled scale-free networks are not necessarily themselves 
scale-free. Further work which addresses the problem of miss- 
ing messages and identifies the consequences of missing data 
on inferred network topology is needed to more fully address 
these questions. 

We find support for the "happiness is assortative" hypothesis 
and evidence that these correlations can be detected up to three 
links away. Further, this finding does not appear to be based on 



users tweeting similar words (Fig. A8). Our correlation coeffi- 
cients for reciprocal-reply networks constructed at the level of 
weeks are smaller than those obtained by Bollen et al. (H for a 
reciprocal-follower network constructed by aggregating over a 
six month period. This diff'erence is likely a reflection of difl'er- 
ences in methodologies, such as our more dynamic time scale 
(one- week periods vs. six month periods), our exclusion of cen- 
tral value happiness scores (i.e., stop words), and our use of the 
Spearman correlation coefficient. 

While this paper does not attempt to separate homophily and 
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contagion, future work could use reciprocal-reply networks to 
investigate these effects. While reciprocal-reply networks are 
subject to errors caused by missing data (see above discussion 
of this issue) they may provide a valuable framework for study- 
ing contagion effects, given that they are based on a conserva- 
tive and dynamic metric of what constitutes an interaction on 
Twitter. A network structure in which hnks are known to be 
active and valid provides an arena in which the diffusion of in- 
formation and emotion may be properly studied. 
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Figure Al: Log-log plot of the complementary cumulative distribution function (CCDF) of the degree distribution for a sample month (January 2009) network is 
shown (blue), along with the best fitting power law model (a = 3.50 and kmin = 109) using the procedure of Clauset, Shalizi, and Newman |41 1. We test whether the 
empirical distribution is distinguishable from a power law using the Kolmogorov-Smirnov test and find no evidence against the null hypothesis (D = 1.82x 10~^, p = 
0.35, n = 495881) data. This distribution may be fit by a power law. 
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Figure A2: Spearman and Pearson correlation coefficients are used to measure degree assortativity. The Pearson correlation coefficient is more sensitive to extreme 
values. As a result, the Pearson correlation coefficient obscures the trend that the network is assortative with respect to the rank of node degrees. Given the nature 
of the degree distribution and the questions that we are asking, we use the Spearman correlation coefficient for our study. 
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Figure A3: Measured happiness assortativity as threshold for labMT word usage increases for a single week network. The Spearman correlation coefficient (right) 
exhibits less variability as compared to the Pearson correlation coefficient (left). Notice that when Ah = 0, there is less variation due to the numerous words centered 
around the mean happiness score, regardless of the threshold, a. 
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Figure A4: A visualization of the reciprocal reply network for the week beginning September 9, 2008 (Week 1) is depicted. The size of a node is proportional to the 
degree, and colors further emphasize the degree detected by Gephis implementation of the algorithm suggested by Blondel et al. L40J . 
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Figure A5: A visualization of the reciprocal reply network for the week beginning September 9, 2008 (Week 1). Colors represent happiness scores for nodes with 
greater than a = 50 labMT words (57% of all nodes in the week). The visualization was produced using Gephi |39|. The algorithm employed by the software 
clusters nodes according to their connectivity. Collections of nodes with similar colors provide a visualization of the happiness is assortativity finding. 
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Tref: Sm. nodes (/zavg=6.05) 
Tcomp: Lg. nodes (/iavg=6.11) 
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Figure A7: The collection of words used by nodes with degree k > 100 (rcomp) is compared to words written by users with degree k < 100 {T^Qf). Words "award" 
and "awards" were excluded because their usage dominated the wordbag of one high degree node (Twitters Shorty Awards). The word "die" was also excluded to 
eliminate the possibility of the hedonometer incorrectly being applied to German tweets. We note that removal of these words resulted in a negligible change to 
all values of reported in the paper. The horizontal bars on the right side of the plot represent words which raise the happiness score of rcomp- The symbols of 
+/- and T / xl combine to convey whether a positive/negative word appears more/less frequently in the Tcomp as compared to the Tj-ef. Notice that an increase in 
the usage of positive words (e.g., "you"), as well as a decrease in the use of a negative word (e.g., "last") will contribute to Tcomp having a higher happiness score. 
On the left hand side of the word shift plot are words which contribute to lowering the happiness score of rcomp- Such examples include an increase in the usage 
of negative words (e.g., "not") as well as a decrease in the usage of positive words (e.g., "home"). The magnitude of the bars indicate the relative contribution of 
each word to these effects. In summary, we see that rcomp has a higher happiness score than does Tref. In the lower right, the relative text sizes are depicted as 
rectangles proportional to the number of words. The reference text, Tref, has considerably more words in its collection than does Tcomp- The circle plots depicted 
the relative amount of positive vs. negative words contained in T^^f and Tcomp- While both collections are similar in terms of positive word usage, the collection of 
words used by larger nodes contains fewer negative words and thus, this contributes to the slightly higher happiness score for this collection of words. The lower 
left inset shows the cumulative sum of individual word contributions as a function of log^o where r is the rank of the 3,686 labMT words. See 1 11 1 for the full 
details of the wordshift graph. 
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Figure A8: The similarity of word bags for pairs of users connected in a week reciprocal reply network is computed as follows: For users i and j, we compute 
Di j = 1 - i Z^?i^ \fi,n - fj,n\, where fi^n represents the normalized frequency of word usage of the nth labMT word by user i. The value of Dij ranges from 
(dissimilar word bags) to 1 (similar word bags). The proportion of occurrences of user-user pairs in the reciprocal reply network for a sample week (Sept. 16, 
2008) having word similarity indices between and 1 are shown (blue dots), with a = 50 and Ah = I. The majority of user-user similarity indices are less than 
0.4, indicating that users and their nearest neighbors use dissimilar collections of words in their tweets. We then perform 100 random permutations of word vector 
assignments to users, while holding the network topology intact (black squares). The resulting distributions show that while users are using more similar words than 
would be expected by chance, this shift is small. The mean score for randomized user-user paired word collections is Dij = .161. This value is not zero, since users 
are using a common language (English). The mean score for our observed network data is Dij = .261, which is slightly higher than the randomized value due to 
conversations occurring between these users. 
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Rank 


Word Frequency Happiness 
(xlO^) 


Rank 


Word Frequency 
(xlO^) 




Rank 


Word Frequency Happiness 
(xlO^) 


1 


you 


103.55 


6.24 


C 1 

JL 


sure 


6.82 


/C '20 

6.32 


1 A1 

101 


music 


4.24 


8.02 


2 


my 


94.91 


6.16 


CO 

j2 


done 


6.81 


^Z C/1 

6.54 


1 AO 

102 


found 


4.23 


6.54 


3 


me 


56.35 


6.58 


53 


show 


6.73 


O /I 

6.24 


103 


doesn't 


4.23 


3.62 


4 


not 


39.98 


3.86 


C A 

54 


awesome 


6.72 


n an 


1 A/1 

104 


online 


4.23 


6.72 




up 


36.04 


6.14 




check 


6.51 


a ^ n 
6.11) 


1 AC 

1(J5 


party 


4.20 


7.58 





no 


34.40 


3.48 


JO 


bed 


6.42 


/.lo 


1 na 
106 


soon 


4.20 


6.34 


/ 


new 


34.03 


6.82 


J / 


sleep 


6.33 


/.LO 


1 nn 
10/ 


thinking 


4.15 


6.28 


o 


like 


31.75 


7.22 




cool 


6.32 


n OA 

7.20 


1 AO 

108 


snow 


4.14 


6.32 




all 


30.71 


6.22 


cn 

59 


live 


6.29 


/C O /I 

6.o4 


1 AA 

109 


give 


4.13 


6.54 


1 r\ 


good 


30.20 


7.20 


oU 


big 


6.28 


oo 
6.22 


1 1 A 

1 lU 


movie 


4.12 


6.84 


11 


will 


23.58 


6.02 


ol 


free 


6.18 


n A/C 

/.96 


111 


ha 


4.09 


6.00 


1 o 


we 


22.59 


6.38 


/^O 

d2 


fife 


6.17 


n QO 
/.32 


1 1 o 

1 12 


sorry 


4.08 


3.66 


1 1 
13 


day 


21.80 


6.24 


o3 


old 


6.07 


1 no 
3.9o 


1 1 Q 

1 13 


real 


4.06 


6.78 


1 /I 
14 


know 


19.45 


6.10 


o4 


didn't 


6.04 


A nn 


11/1 
1 14 


kids 


3.98 


7.38 


1j 


more 


19.32 


6.24 


55 


find 


6.00 


6.00 


1 1 c 

115 


phone 


3.91 


6.44 


1 

16 


don't 


18.29 


3.70 


00 


die 


6.00 


1 HA 

1.74 


1 1 
1 16 


tv 


3.91 


6.70 


1 / 


today 


18.24 


6.22 


o/ 


video 


5.99 


a AO 

6.4o 


1 1 

11/ 


stop 


3.89 


3.90 




love 


17.66 


8.42 


DO 


house 


5.99 


fZ Q.A 

6.34 


110 

1 lo 


play 


3.88 


7.26 


1 n 
19 


think 


17.45 


6.20 


o9 


Christmas 


5.89 


n n/z 
/.96 


1 1 A 

119 


waiting 


3.88 


3.68 




see 


15.28 


6.06 


nn 
/(J 


playing 


5.77 


T 1 /I 
/.14 


1 OA 

12(J 


lunch 


3.81 


7.42 


o 1 
21 


great 


14.60 


7.88 


^ 1 
/I 


world 


5.76 


CO 

6.52 


1 O 1 

121 


food 


3.79 


7.44 


oo 

22 


lol 


13.35 


6.84 


'TO 

12 


game 


5.54 


/C AO 

6.92 


1 oo 
122 


reading 


3.76 


6.78 


o^ 
23 


thanks 


13.09 


7.40 


ni 
15 


wow 


5.54 


/.46 


1 OQ 

123 


god 


3.74 


7.28 


O A 

24 


home 


13.05 


7.14 


HA 
/4 


ready 


5.53 


a CO 

6.5o 


1 O /I 

124 


top 


3.65 


6.76 


oc 

2j 


people 


12.71 


6.16 


IJ 


iphone 


5.53 


C/l 

6.54 


1 oc 
125 


buy 


3.60 


6.28 


O/C 

2o 


night 


12.70 


6.22 


lb 


listening 


5.41 


oo 

6.2o 


1 O/C 

126 


book 


3.56 


7.24 


O'V 

2/ 


blog 


12.26 


6.02 


1 1 


pretty 


5.40 


n ^20 

1 .51 


1 O'V 

12/ 


car 


3.56 


6.72 


oo 
2o 


last 


11.89 


3.74 


no 


always 


5.39 


a AO 

6.4o 


1 oo 
12o 


idea 


3.52 


7.06 


on 
2y 


well 


11.70 


6.68 


nn 
/9 


help 


5.27 


/C AO 

6.0o 


1 OA 

12b' 


friend 


3.51 


7.66 


3U 


make 


11.27 


6.00 


OA 


read 


5.07 


CO 

6.52 


1 m 
130 


family 


3.51 


7.72 


'2 1 

31 


right 


11.04 


6.54 


O 1 

ol 


google 


5.05 


n OA 


1 '2 1 

131 


yay 


3.47 


6.10 


^0 

J2 


can't 


10.93 


3.42 


QO 
o2 


everyone 


5.03 


^10 
D.12 


1 QO 
152 


glad 


3.47 


7.48 


33 


morning 


10.38 


6.56 


OQ 

o3 


most 


4.95 


/z oo 
6.22 


1 n 

133 


least 


3.46 


4.00 


'2/1 


very 


10.10 


6.12 


o4 


wait 


4.88 


5. /4 


1 Q/i 
154 


nothing 


3.44 


3.90 


3j 


first 


9.69 


6.82 


o c 
o5 


start 


4.87 


a ^ n 
6.10 


1 Q c 

135 


late 


3.43 


3.46 


3o 


our 


9.26 


6.08 


OO 


please 


4.79 


fZ IfZ 

6.36 


1 IfZ 

136 


internet 


3.39 


7.48 


37 


better 


8.89 


7.00 


O'V 

o7 


con 


4.78 


1 'VA 

3.70 


1 m 
137 


amazing 


3.38 


7.66 


3o 


us 


8.82 


6.26 


OO 

oo 


try 


4.77 


a AO 

6.U2 


1 QO 

13o 


mean 


3.38 


3.68 


39 


tonight 


8.79 


6.14 


on 
o9 


thought 


4.69 


6.3o 


1 QA 

139 


myself 


3.37 


6.30 


4U 


down 


8.73 


3.66 


nn 
9U 


school 


4.66 


^z ^^z 
6.26 


1 An 
140 


facebook 


3.34 


6.08 


A 1 

41 


happy 


o An 




n 1 

yi 


thank 


A £iA 

4.64 


n An 


1/11 
141 


funny 


Q Q O 

3.32 


n AO 

7.92 


A O 

42 


tomorrow 


7.88 


6.18 


no 

92 


weekend 


4.56 


O AA 

0.00 


1 /I o 

142 


tired 


3.29 


3.34 




nice 


7.80 


7.38 




hey 


4.48 


o.uo 




talk 


3.29 


6.06 


44 


best 


7.61 


7.18 


94 


wish 


4.44 


6.92 


144 


damn 


3.26 


2.98 


45 


she 


7.57 


6.18 


95 


hate 


4.42 


2.34 


145 


interesting 


3.26 


7.52 


46 


yes 


7.42 


6.74 


96 


haha 


4.41 


7.64 


146 


own 


3.24 


6.16 


47 


fun 


7.37 


7.96 


97 


friends 


4.41 


7.92 


147 


friday 


3.23 


6.88 


48 


hope 


7.34 


7.38 


98 


making 


4.40 


6.24 


148 


open 


3.18 


6.10 


49 


bad 


6.98 


2.64 


99 


dinner 


4.27 


7.40 


149 


lost 


3.16 


2.76 


50 


never 


6.92 


3.34 


100 


coff'ee 


4.27 


7.18 


150 


guys 


3.16 


6.22 



Table Al: The top 150 most frequently occurring words from the labMT list in our Sept 2008 through Feb 2009 data set, where stop words (4 < /zavg < 6) have been 
removed. 
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Rank Word Frequency Happiness 
(xlO^) 


Rank 


Word Frequency Happiness 
(xlO^) 


Rank 


Word Frequency Happiness 
(xlO^) 


1 


the 


295.60 


4.98 


51 


an 


22.73 


4.84 


101 


el 


11.76 


4.80 


2 


to 


249.91 


4.98 


52 


we 


22.59 


6.38 


102 


well 


11.70 


6.68 


3 


i 


221.28 


5.92 


53 


some 


22.32 


5.02 


103 


oh 


11.69 


4.84 


4 


a 


218.13 


5.24 


54 


que 


22.26 


4.64 


104 


who 


11.64 


5.06 


5 


and 


135.23 


5.22 


55 


day 


21.80 


6.24 


105 


should 


11.48 


5.24 


6 


is 


127.94 


5.18 


56 


how 


21.64 


4.68 


106 


over 


11.34 


4.82 


7 


in 


122.94 


5.50 


57 


going 


20.64 


5.42 


107 


make 


11.27 


6.00 


8 


of 


121.79 


4.94 


58 


am 


20.60 


5.38 


108 


then 


11.15 


5.34 


9 


for 


114.41 


5.22 


59 


go 


20.03 


5.54 


109 


right 


11.04 


6.54 


10 


you 


103.55 


6.24 


60 


has 


19.68 


5.18 


110 


can't 


10.93 


3.42 


11 


on 


96.97 


5.56 


61 


or 


19.55 


4.98 


111 


way 


10.84 


5.24 


12 


my 


94.91 


6.16 


62 


know 


19.45 


6.10 


112 


only 


10.72 


4.92 


13 


it 


91.09 


5.02 


63 


more 


19.32 


6.24 


113 


getting 


10.63 


5.68 


14 


that 


69.81 


4.94 


64 


la 


18.77 


5.00 


114 


his 


10.56 


5.56 


15 


at 


58.51 


4.90 


65 


don't 


18.29 


3.70 


115 


morning 


10.38 


6.56 


16 


with 


56.42 


5.72 


66 


today 


18.24 


6.22 


116 


very 


10.10 


6.12 


17 


me 


56.35 


6.58 


67 


too 


18.15 


5.22 


117 


after 


9.82 


5.08 


18 


just 


50.25 


5.76 


68 


they 


18.09 


5.62 


118 


watching 


9.76 


5.84 


19 


have 


49.86 


5.82 


69 


work 


17.95 


5.24 


119 


her 


9.73 


5.84 


20 


be 


46.10 


5.68 


70 


got 


17.91 


5.60 


120 


them 


9.71 


4.92 


21 


this 


45.75 


5.06 


71 


love 


17.66 


8.42 


121 


first 


9.69 


6.82 


22 


de 


44.38 


4.82 


72 


think 


17.45 


6.20 


122 


e 


9.66 


4.72 


23 


so 


40.93 


5.08 


73 


back 


17.37 


5.18 


123 


that's 


9.55 


5.28 


24 


not 


39.98 


3.86 


74 


twitter 


17.18 


5.46 


124 


rt 


9.52 


4.88 


25 


i'm 


39.89 


5.74 


75 


when 


16.84 


4.96 


125 


y 


9.47 


4.48 


26 


are 


39.03 


5.16 


76 


there 


16.39 


5.10 


126 


than 


9.42 


4.74 


27 


but 


37.78 


4.24 


77 


had 


15.30 


4.74 


127 


its 


9.36 


4.96 


28 


was 


37.74 


4.60 


78 


see 


15.28 


6.06 


128 


our 


9.26 


6.08 


29 


up 


36.04 


6.14 


79 


en 


14.97 


4.84 


129 


better 


8.89 


7.00 


30 


out 


35.20 


4.62 


80 


really 


14.93 


5.84 


130 


us 


8.82 


6.26 


31 


now 


35.12 


5.90 


81 


off 


14.89 


4.02 


131 


tonight 


8.79 


6.14 


32 


no 


34.40 


3.48 


82 


great 


14.60 


7.88 


132 


down 


8.73 


3.66 


33 


new 


34.03 


6.82 


83 


need 


14.45 


4.84 


133 


i've 


8.59 


5.58 


34 


do 


33.96 


5.76 


84 


he 


14.34 


5.42 


134 


u 


8.40 


5.52 


35 


from 


33.78 


5.18 


85 


still 


13.74 


5.14 


135 


happy 


8.40 


8.30 


36 


like 


31.75 


7.22 


86 


been 


13.43 


5.04 


136 


again 


8.34 


5.42 


37 


your 


31.43 


5.60 


87 


lol 


13.35 


6.84 


137 


could 


8.34 


5.52 


38 


all 


30.71 


6.22 


88 


would 


13.15 


5.38 


138 


un 


8.15 


4.64 


39 


good 


30.20 


7.20 


89 


thanks 


13.09 


7.40 


139 


into 


8.08 


5.04 


40 


get 


30.04 


5.92 


90 


home 


13.05 


7.14 


140 


i'U 


8.05 


5.38 


41 


what 


29.46 


4.80 


91 


want 


12.81 


5.70 


141 


man 


7.99 


5.90 


42 


about 


28.97 


5.16 


92 


people 


12.71 


6.16 


142 


tomorrow 


7.88 


6.18 


43 


it's 


27.14 


4.88 


93 


night 


12.70 


6.22 


143 


nice 


7.80 


7.38 


44 


if 


25.21 


4.66 


94 


here 


12.28 


5.48 


144 


any 


7.70 


5.22 


45 


by 


24.66 


4.98 


95 





12.26 


4.96 


145 


take 


7.63 


5.18 


46 


as 


24.50 


5.22 


96 


blog 


12.26 


6.02 


146 


best 


7.61 


7.18 


47 


time 


24.19 


5.74 


97 


why 


12.10 


4.98 


147 


she 


7.57 


6.18 


48 


one 


23.73 


5.40 


98 


much 


11.92 


5.74 


148 


even 


7.42 


5.58 


49 


will 


23.58 


6.02 


99 


last 


11.89 


3.74 


149 


yes 


7.42 


6.74 


50 


can 


23.57 


5.62 


100 


did 


11.84 


5.58 


150 


little 


7.38 


4.60 



Table A2: The top 150 most frequently occurring words from the labMT word list in our Sept 2008 through Feb 2009 data set including stop words. 
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Week 


Start date 


TV 


< A: > 


^max 


Cg 


Assort 


# Comp. 


S 


1 
1 


HQ HQ HQ 


>'J04 / 


QQ 


ZOi 


U. iU 


U.Z4 


1 n^^/i 
iUJo4 


U. / i 


L 


no 1 ^ HQ 






Q 1 Q 


U.iU 


n 0/1 
U.Z4 


1 1 n^o 
i iUoZ 


n 'v 1 
U. / i 


D 


no HQ 




o on 




n no 


n 1 Q 
U.l J 


11/1 

1 14j / 


U. /U 


4 


r\Q Qfk r\Q 


iUUzzo 


Z.O / 


JJo 


n no 


n 1 Q 
U.iJ 


1 1 'T^O 

i i / jZ 




c 


iU.U /.Uo 




Z.OU 


0/1 1 
Z41 




not 
U.Zl 


1 1 1 /in 
1 1 14U 


U.OJ 




10.14.08 


122644 


3.20 


3Q4 


0.09 


0.14 


12221 


0.74 


7 


10.21.08 


130027 


3.30 


559 


0.08 


0.09 


12420 


0.75 


8 


10.28.08 


144036 


3.56 


492 


0.08 


0.14 


12319 


0.78 


9 


11.04.08 


145346 


3.54 


330 


0.08 


0.19 


12597 


0.78 


10 


11.11.08 


136534 


3.35 


441 


0.08 


0.12 


12972 


0.76 


11 


11.18.08 


153486 


3.46 


444 


0.08 


0.13 


13594 


0.77 


12 


11.25.08 


155753 


3.46 


1244 


0.06 


0.00 


14122 


0.77 


13 


12.02.08 


165156 


3.44 


1245 


0.06 


0.01 


14496 


0.78 


14 


12.09.08 


162445 


3.33 


1456 


0.05 


0.01 


15342 


0.76 


15 


12.16.08 


148154 


3.12 


730 


0.06 


0.04 


15645 


0.73 


16 


12.23.08 


140871 


3.22 


575 


0.07 


0.07 


15216 


0.72 


17 


12.30.08 


143015 


3.30 


519 


0.07 


0.15 


15272 


0.73 


18 


01.06.09 


170597 


3.19 


253 


0.07 


0.18 


17234 


0.74 


19 


01.13.09 


188429 


3.29 


477 


0.07 


0.13 


18403 


0.75 


20 


01.20.09 


196038 


3.16 


680 


0.06 


0.04 


19927 


0.74 


21 


01.27.09 


203852 


3.04 


973 


0.05 


0.01 


21537 


0.73 


22 


02.03.09 


212513 


2.92 


1718 


0.04 


-0.01 


24387 


0.71 


23 


02.10.09 


213936 


2.83 


828 


0.06 


0.02 


25854 


0.70 


24 


02.17.09 


215172 


2.65 


437 


0.06 


0.07 


28742 


0.67 


25 


02.24.09 


170180 


2.27 


320 


0.06 


0.04 


28388 


0.58 



Table A3: Network statistics for reciprocal-reply networks by week. As Twitter popularity grows, so does the number of users iN) in the observed reciprocal-reply 
network. The average degree (< k >), degree assortativity, the number of nodes in the giant component (# Comp.), and the proportion of nodes in the giant 
component {S) remain fairly constant, whereas the maximum degree (A:niax) shows a great deal of variability from month to month. Clustering (Cg) shows a slight 
decrease over the course of this period. 
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Week 






# Total M<i(T<i 






/(J rvcjjiico 






XiU 


XiU 


/ #Obsvd. 1 nn\ 
[ #Total X l^^j 


XiU 


/ ^Replies ^ 
[Wbsvd. Xl^^j 


1 


09.09.08 


3.14 


7.26 


43.2 


0.88 


28.1 


2 


09.16.08 


3.36 


8.31 


40.4 


0.90 


26.9 


3 


09.23.08 


3.43 


8.89 


38.6 


0.90 


26.2 


4 


09.30.08 


3.33 


9.06 


36.8 


0.89 


26.6 


5 


10.07.08 


2.33 


9.38 


24.8 


0.64 


27.5 


6 


10.14.08 


4.39 


9.87 


44.4 


1.24 


28.3 


7 


10.21.08 


4.70 


10.01 


47.0 


1.35 


28.8 


8 


10.28.08 


5.74 


10.34 


55.5 


1.64 


28.5 


9 


11.04.08 


5.58 


11.14 


50.1 


1.63 


29.3 


10 


11.11.08 


4.70 


9.88 


47.6 


1.42 


30.2 


11 


11.18.08 


5.48 


11.34 


48.3 


1.67 


30.5 


12 


11.25.08 


5.71 


11.47 


49.8 


1.73 


30.2 


13 


12.02.08 


5.54 


12.85 


43.1 


1.80 


32.4 


14 


12.09.08 


5.41 


13.54 


39.9 


1.72 


31.7 


15 


12.16.08 


4.57 


12.72 


35.9 


1.45 


31.8 


16 


12.23.08 


4.80 


11.62 


41.3 


1.46 


30.5 


17 


12.30.08 


4.61 


13.48 


34.2 


1.50 


32.5 


18 


01.06.09 


5.16 


16.11 


32.0 


1.72 


33.3 


19 


01.13.09 


5.73 


17.33 


33.1 


1.97 


34.4 


20 


01.20.09 


5.82 


18.87 


30.9 


1.98 


34.1 


21 


01.27.09 


5.75 


20.79 


27.6 


1.98 


34.5 


22 


02.03.09 


5.78 


22.42 


25.8 


2.01 


34.8 


23 


02.10.09 


5.66 


23.39 


24.2 


1.99 


35.1 


24 


02.17.09 


5.43 


25.71 


21.1 


1.91 


35.1 


25 


02.24.09 


3.80 


20.75 


18.3 


1.34 


35.1 



Table A4: The number of "observed" messages in our database comprise a fraction of the total number of Twitter message made during period of this study 
(September 2008 through February 2009). While our feed from the Twitter API remains fairly constant, the total # of tweets grows, thus reducing the % of all 
tweets observed in our database. We calculate the total # of messages as the difference between the last message id and the first message id that we observe for 
a given month. This provides a reasonable estimation of the number of tweets made per month as message ids were assigned (by Twitter) sequentially during the 
time period of this study. We also report the number observed messages that are replies to specific messages and the percentage of our observed messages which 
constitute replies. 
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